# AGTR Multiclass Classifier Evaluation

Bounds on the precision, recall, accuracy, and error rate of a multiclass classifier can be provably found using an approximate ground truth refinement, or AGTR. This package provides an implementation of this evaluation framework for malware classifiers, specifically classifiers that categorize malware samples in the PE file format by family.


## Installation

```
pip install jsonlines
pip intall pefile
git clone https://github.com/knowmalware/pehash.git
cd pehash
python ./setup.py build
sudo -HE python ./setup.py install
```


## Usage
```
usage: pehash_agtr.py [-h] [--agtr-jsonl AGTR_JSONL] [--agtr-dir AGTR_DIR]
                      [--num-processes NUM_PROCESSES] [--verbose]
                      pred_jsonl

positional arguments:
  pred_jsonl            The path to the .jsonl file containing the predicted
                        labels. Format: {"md5": md5_val, "label": label_val}

optional arguments:
  -h, --help            show this help message and exit
  --agtr-jsonl AGTR_JSONL
                        The path to a .jsonl file containin the AGTR cluster
                        labels. Format: {"md5": md5_val, "pehash": pehash_val}
  --agtr-dir AGTR_DIR   A directory containing the files whose labels were
                        predicted.
  --num-processes NUM_PROCESSES
                        The maximum number of processes.
  --verbose
```

## Example
```
python3 pehash_agtr.py ./sample_data/pred.jsonl --agtr-jsonl ./sample_data/agtr.jsonl
```

Output:
```
Precision lower bound: 0.467
Recall upper bound: 0.975
Accuracy upper bound: 0.975
Error rate lower bound: 0.025
```

## Full dataset

The full VirusShare dataset we use can be obtained from John Seymour's google drive link here: https://drive.google.com/drive/folders/0B_IN6RzP69b2WC1wUjNqajYxRXM 

## Distribution

This code in its current form is provided for the purposes of reviewing. Our plan, if accepted, is to provide a github repo with the code and a chosen licenes at our companie's github repo page. This plan is in place because if the paper is not accepted, we would want to be able to submit to a double-blind venue, which we can not maintain double-blind for future submissions if we place it on github now. 